Record: 11-gram Eval Cache + Hedge Mixer (val_bpb: 0.8609)#909
Open
sunnypatneedi wants to merge 26 commits intoopenai:mainfrom
Open
Record: 11-gram Eval Cache + Hedge Mixer (val_bpb: 0.8609)#909sunnypatneedi wants to merge 26 commits intoopenai:mainfrom
sunnypatneedi wants to merge 26 commits intoopenai:mainfrom
Conversation
Two-phase TTT pipeline (novel combination): - Phase 1: In-Place TTT — updates MLP output projections per-document (ICLR 2026) - Phase 2: Per-doc LoRA TTT — adapts Q/V/LM head with surprise gating (top-K tokens) Architecture: PR openai#486 base (11L, TrigramHash, ValueResidual, GradQuant) + LeakyReLU(0.5)^2 + eval-only XSA on all layers + Partial RoPE + LN Scale Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- gptq_calibrate(): collect Hessian H=X^TX via forward hooks on training data - gptq_quantize_weight(): column-wise int6 with Cholesky error compensation - _find_best_row_scales(): percentile search for optimal per-row scales - Integrated into mixed_quantize_int6() — falls back to naive when no Hessian - Expected: -0.0026 bpb from better quantization alone (PR openai#535 ablation) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bug 1: Function adapted MLP weights but never scored documents. All compute was wasted — no loss/bpb accumulation. Fix: Rewrote as inplace_ttt_eval() with apply-then-update loop: score chunk first (accumulate bpb), then gradient-update MLP proj. Bug 2: Model left in last document's adapted state after function. This corrupted subsequent LoRA TTT evaluation. Fix: Reset MLP weights to original after all documents. Also: Made In-Place TTT and LoRA TTT alternatives (config switch) rather than sequential phases, since both produce val_bpb scores. Use INPLACE_TTT_ENABLED=1 for In-Place, =0 for LoRA TTT. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Run 1 results: - Artifact 16.35MB (352KB over 16MB limit) — caused by GradQuant int7 - LoRA TTT took 1572s (2.6x over 600s budget) — 20 epochs too many - Pre-quant val_bpb: 1.1757 (46 shards, not full 80) - Post-quant sliding window: 1.3569 Fixes: - GradQuant: top-10% sensitivity stays int6 (not int7) - TTT epochs: 20 → 5 (should complete in ~400s) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Run 1 showed: - Pre-quant val_bpb: 1.1757 - Post-quant sliding window: 1.3569 - Quantization penalty: 0.18 bpb (expected ~0.003) Root cause: Our GPTQ implementation (ported from PR openai#535) produced WORSE quantization than standard per-row int6. PR openai#486 base doesn't use GPTQ at all. Possible issues: bad Hessian calibration, numerical instability in Cholesky decomposition, or name mismatch between hooks and state dict keys. Fix: Disable GPTQ, revert to standard quantization path. GPTQ code preserved for future debugging. Also confirmed: TTT bpb formula is algebraically correct. The 0.6185 bpb was real (20 epochs = heavy per-doc overfitting). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Run 0: PR openai#548 UNMODIFIED (1.0865 proven). Reproduce baseline. Run 1: PR openai#548 + LeakyReLU(0.5)^2 (1 line change). Measure delta. Following retro lesson: baseline first, one change at a time. No GPTQ, no In-Place TTT, no XSA, no surprise gating. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… PR openai#548 Run 0: PR openai#414 UNMODIFIED (merged SOTA 1.1228, verified 3-seed) Run 1: PR openai#414 + LeakyReLU(0.5)^2 (1 line change) Baseline against verified numbers, not claimed scores from open PRs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Builds on Run 1 (PR openai#414 + LeakyReLU). Adds: - temperature param to eval_val_sliding (default 1.0, no change) - After main eval, sweeps T={0.95,0.96,0.97,0.98,0.99} - PR openai#576 reported T=0.98 gives -0.003 bpb for free 10 lines added over Run 1. Zero training cost. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Builds on Run 2. Changes from PR openai#414 base: - MLP expansion: 3.0x → 3.5x (1536 → 1792 hidden, more params) - Quantization: int6 → int5 (clip_range 31→15, fits more params) - QAT: enabled with threshold 0.5 (early start, matching PR openai#576) - QAT uses quantile(0.9995) clip instead of row max - BigramHash: 2048 → 8192 buckets From PR openai#576's "Train Larger, Quantize Harder" approach (1.1164 bpb). 8 lines changed from Run 2. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Template includes: - README.md with placeholder results table - submission.json with schema matching existing PRs - submit.sh helper to collect logs and extract metrics Fill in after successful runs, rename folder, PR to upstream. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
In-Place TTT: loss INCREASES (2.63+), 955s+ eval time. Harmful. GradQuant int5/int6 mix: 34KB over 16MB even without int7. PR openai#486 baseline reproduced at 1.1249 (within seed variance of 1.1233). Added lessons 13-16 to CLAUDE.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR openai#414 hardcodes `from flash_attn_interface import ...` (FA3/Hopper only). This pod has FA2 but not FA3. Added try/except + SDPA fallback in attention. Applied to all 4 runs (0-3). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pod has flash_attn 2.8.3 (from flash_attn import flash_attn_func) but NOT flash_attn_interface (FA3/Hopper). Added cascading import. Also keeping SDPA fallback for environments with no flash_attn at all. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Run 0: PR openai#549 UNMODIFIED (merged SOTA 1.1194, verified 3-seed) Run 1: PR openai#549 + TTT_ENABLED=1 + TTT_LR=0.0005 (2 lines changed) Both have FA3→FA2→SDPA fallback for non-Hopper GPUs. Following retro: one change per run, baseline first. Expected: Run 1 should achieve ~1.094-1.104 (beats 1.1144 target). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Upgrades TTT from PR openai#549's weak 3ep SGD (-0.0025 bpb) to PR openai#481's proven AdamW 30ep cosine + per-layer LR recipe (expected -0.01 to -0.025). Changes: - train_gpt.py: Added _ttt_run_phase() + ttt_adapt() + TTT hyperparams - run_3seeds.sh: Added TTT env vars for 3-seed validation - finalize_submission.py: Extracts pre/post TTT metrics from logs - README.md + submission.json: Updated for TTT-enabled submission Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prevents "tensor does not have a device" error when torch.compile tries to recompile after TTT modified model weights. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR openai#549 SOTA base + PR openai#481 AdamW TTT recipe. Replaces weak 3ep SGD TTT with 30ep cosine decay + per-layer LR (mlp.proj 3x, mlp.fc 0.5x). 3-seed mean: 1.0705 (std 0.0009). All artifacts under 16MB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…_bytes PR openai#771 was listed as "0 seeds" in the competition tracker because submission.json was missing the required `seeds` and `track` fields, and used `bytes_total` instead of the expected `artifact_bytes` field. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…hanced n-gram - train_gpt_v10_safe.py: v9a + Hedge Mixer (multiplicative weights) + add-delta n-gram smoothing, dim=512 - train_gpt_v10_moonshot.py: model_dim=640 (42M params) + adaptive quant (ternary MLP / int4 attn / int6 embed) + Hedge Mixer - auto_experiment.py: local CPU random search over 20 configs, logs to experiments.jsonl - submit.sh: packaging and staging script for H100 runs - PLAN.md: strategy doc with size estimates and run order Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- validate_configs.py: CPU-only artifact size estimator for moonshot configs (no GPU/data needed) - experiments.jsonl: 20 initial random search results from auto_experiment.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude/peaceful mclean
v10 moonshot: ternary MLP quant + scaled model + hedge mixer + enhanced n-gram
3-seed mean 0.8609 bpb (42→0.8600, 1337→0.8611, 2025→0.8616). All artifacts under 16MB. 11-gram n-gram cache with entropy-adaptive alpha and Hedge Mixer on PR openai#549 base architecture. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3-seed mean 0.8609 bpb (42→0.8600, 1337→0.8611, 2025→0.8616). All artifacts under 16MB. 11-gram n-gram cache with entropy-adaptive alpha and Hedge Mixer on PR openai#549 base architecture. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add comprehensive experiment tracking and moonshot submissions
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
11-gram Eval Cache + Hedge Mixer on PR #549 Base
val_bpb: 0.8609 (3-seed mean, std 0.0008, sliding window stride=64) | ~15.9 MB | 8×H100 SXM
Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)
Key Innovation: 11-gram Eval Cache with Entropy-Adaptive Mixing
The n-gram eval cache provides -0.284 bpb — the single largest improvement over the base model. It replaces TTT entirely, freeing the full eval time budget.
order_centers = 3.0 - 0.25 * (matched_order - min_order)N-gram Protocol
Run Config
cd /workspace/parameter-golf SEED=42 BIGRAM_VOCAB_SIZE=0 VE_DIM=64 GRADQUANT_ENABLED=0 \ torchrun --standalone --nproc_per_node=8 \ records/track_10min_16mb/2026-03-26_sunnypatneedi_moonshot/train_gpt.pyAll hyperparameters are baked into the script as defaults. Key environment variables:
Timing Budget
Training Architecture (from PR #549 SOTA)
Ablation
Credits